专利摘要:
An image of a printed document portion is provided to a synchronizer. The synchronizer retrieves an electronic version of the printed document and identifies an electronic text portion that is textually similar to a printed text portion. The synchronizer detects an annotation in the printed document portion and inserts a corresponding digital annotation into the electronic document.
公开号:ES2555180A2
申请号:ES201590093
申请日:2014-02-27
公开日:2015-12-29
发明作者:Aaron Cooper
申请人:Thomson Reuters Global Resources ULC;
IPC主号:
专利说明:

5
10
fifteen
twenty
25
30
35
DESCRIPTION
Method implemented by computer to synchronize annotations between a printed document and an electronic document, computer readable support and corresponding system.
Cross reference to related requests
The present application vindicates the rights of the non-provisional application US No. 13 / 781,446, filed on February 28, 2013, and expressly incorporates it in its entirety as a reference.
Background
The popularity of electronic books and other electronic documents has increased among readers, such as legal professionals, medical professionals, students and others, in recent years. Some of these readers maintain a printed version of a document, as well as an electronic version of it. Many readers have become accustomed to making annotations in printed documents highlighting and underlining parts of text, writing notes in the margins, writing notes between lines of printed text, crossing out printed text, and the like. Although conventional electronic books and other electronic documents sometimes allow you to directly add digital annotations by means of a reading device, the reader should normally continue to refer to the printed version of the document in which he made the annotations initially, since these platforms of Electronic documents do not allow synchronization of annotations between a printed document and an electronic version of the printed document.
Summary
Forms of realization of the present invention disclosed herein facilitate the synchronization of annotations between a printed document and an electronic document. In embodiments, an image of a part of a printed document is provided to a synchronizer. The synchronizer can retrieve an electronic version of the printed document and identify a part of electronic text that is textually similar to a part of printed text. According to embodiments, the synchronizer detects an annotation in the part of the printed document and inserts a corresponding digital annotation in the similar part of the electronically identified text document.
Brief description of the drawings
Figure 1 is a block diagram illustrating an operating environment (and, in some 5 forms of realization, aspects of the present invention) in accordance with embodiments of the present invention;
Figure 2 is a schematic diagram depicting an illustrative operation of a synchronizer in accordance with embodiments of the present invention;
10
Figure 3 is a schematic diagram depicting an illustrative pruning operation in accordance with embodiments of the present invention;
Figure 4 is a flowchart depicting an illustrative method of synchronization of annotations between a printed document and an electronic document, in accordance with embodiments of the present invention; Y
Fig. 5 is a flowchart depicting an illustrative method of detecting annotations in a part of a printed document in accordance with embodiments of the present invention.
Although the present invention is open to various modifications and alternative forms, specific embodiments have been shown in the drawings as examples and are described in detail below. However, the present invention is not limited to the particular embodiments described. On the contrary, the present invention is intended to cover all modifications, equivalents, and alternatives that fall within the scope of the present invention as defined by the appended claims.
Although the term "block" may be used herein to connote different elements 30 of illustrative methods used, the term should not be construed to imply any requirement of various stages, or particular order between them, disclosed herein, unless explicitly reference is made to the order of individual stages and only in that case.
5
10
fifteen
twenty
25
30
35
Detailed Description
In realization, a reader, such as a student, a doctor or a lawyer, may want to keep a printed version of a document and an electronic version of it at the same time, or you may want to start using only an electronic version of a document after using the printed version. The printed document may include an unlimited number of annotations, such as, for example, highlighted, underlined parts, handwritten notes, bookmarks, and the like, which the reader may wish to have available in the electronic version of the document. Forms of realization of the present invention include a synchronizer that facilitates the synchronization of annotations, such as those mentioned, between the printed document and the electronic version, so that the annotations may be available to the reader in the electronic version of the document. For example, the reader can create images of annotated parts of the printed document (for example, using a camera or scanner) and provide those images to the synchronizer, or the reader can provide the printed document to a scanning service (which, in forms of realization, it can be associated to the synchronizer). The scanning service can use a scanning device to create images of the annotated parts of the printed document and provide the resulting images to the synchronizer. Additionally, in embodiments, the synchronizer can facilitate the navigation of an electronic version of a printed document such as, for example, to facilitate the direct insertion, by the reader, of digital annotations in the electronic version. For example, a reader can provide an image of a part of the printed document to the synchronizer, which uses the image to locate and display a corresponding part of the electronic document, in which the reader can directly insert a digital annotation.
Figure 1 depicts an illustrative operating environment 100 (and, in some embodiments, aspects of the present invention) in accordance with embodiments of the present invention as illustrated by way of example. As shown in Figure 1, embodiments of the operating environment 100 include a server 102 that provides a copy 104 (labeled as "electronic document copy") of an electronic document 106 (labeled as "electronic document") a a reading device 108 and housing a synchronizer 126, as described in more detail below. In embodiments, the synchronizer 126 may be hosted by the reading device 108 or other computing device, and the server 102 may simply act as a repository for the electronic document 106.
5
10
fifteen
twenty
25
30
35
As the name implies, the reading device 108 is, in realization forms, what the reader uses to read electronic documents, and includes a visualization module 110 in which the copy 104 of the electronic document 106 can be displayed. In forms of embodiment, the electronic document 106 and the copy 104 are electronic versions of a printed document (not shown) and may be in their original form, without annotations, or in a form with annotations. An electronic version of a printed document may include the same content as the printed document or substantially similar content, although it may also include different content. For example, an electronic version of a printed document may include an updated publication (eg, edition) of the printed document, a version with annotations of the printed document, and the like. Examples of documents (both printed and electronic) include books, articles, court decisions, compilations of laws, treaties, footnotes, reference notes, translation notes, and the like.
In embodiments, a user (not shown) downloads the copy 104 of the electronic document 106 using the reading device 108, to access the server 102 through a communications network 112 such as, for example, a local area network , a business network, Internet, or the like. The copy 104 of the electronic document 106 may also be provided to the reading device 108 by means of a removable memory device such as, for example, a compact disc, a flash storage unit, or the like. According to embodiments, electronic document 106 (and copy 104 thereof) can be materialized in one or more files using an unlimited number of various formats such as, for example, DjVu, EPUB®, FictionBook Kindle®, Reader Microsoft®, eReader®, Plucker, plain text in ASCII, UNICODE, markup languages, a platform-independent document format, such as the portable document format (PDF), and the like. Electronic document 106 (and / or copy 104 thereof) can also be materialized in the Thomson Reuters ProView® format, available from Thomson Reuters in New York, New York. Examples of markup languages, and corresponding markup language files, include Hypertext Markup Language (HTML), Extensible Markup Language (XML), Extensible Hypertext Markup Language (XHTML), and the like.
As shown in FIG. 1, the reading device 108 includes a processor 114 and a memory 116. According to embodiments, the reading device 108 is an information device and can take, for example, the form of a specialized computing device or a general purpose computing device, such as a computer
5
10
fifteen
twenty
25
30
35
staff, a workstation, a personal digital assistant (PDA), a mobile phone, a smart phone, a tablet, a notebook-type laptop, or the like. In memory 116 an electronic book reader component 118 is stored. In embodiments, the processor 114 executes the electronic book reader component 118, which can cause at least a portion of the copy 104 of the electronic document 106 is displayed in the visualization module 110. The electronic book reader component 118 can also facilitate other operations and interactions associated with copy 104, such as, for example, insertion of digital annotations, search, bookmarks, and the like. , as explained in more detail below. In embodiments, the electronic book reader component 118 can access the server 102 to cause at least a portion of the electronic document 106 to be displayed in the display module 110.
As shown in Figure 1, the server 102 includes a processor 122 and a memory 124. The synchronizer 126 may be stored in the memory 124. In embodiments, the processor 122 executes the synchronizer 126, which may facilitate the navigation of an electronic document (or a copy 104 thereof, for example, interacting with a reading device 108) and synchronization of annotations between a printed document and an electronic document 106 (and / or a copy 104 thereof). The electronic document 106 may be stored in content storage media 128 in the memory 124. In embodiments, to facilitate synchronization of annotations, an image 107 of a part of a printed document is provided (to which it is made reference herein as "part of printed document") to server 102, which can also store image 107 on content storage media 128.
According to embodiments, a part of a printed document may include one or more annotations in the vicinity of a part of printed text (referred to herein as "part of printed text") and may be contained in one or more pages of one or more printed documents The annotations may include, for example, highlighted, underlined parts, handwritten notes in a margin or between lines of printed text, bookmarks, adhesive bookmarks, and the like. Printed text may include, for example, one or more chapters; one or more paragraphs; one or more lines; one or more words; one or more characters; one or more parts of a chapter, paragraph, line, word, or character or similar, in addition, a part of printed text may include a passage of text and one or more footnotes, endnotes, figures, tables, or the like.
5
10
fifteen
twenty
25
30
35
According to embodiments, the image 107 of the printed document portion can be created, for example, using a camera 120 integrated with the reading device 108 (for example, when the reading device is a smart phone or computer). tablet type) and can be provided to server 102 via the communications network 112. For example, a reader can use camera 120 to photograph (ie create an image 107 of) a part with annotations of a printed document and communicate image 107 to server 102, together with a request that an electronic version 106 of the printed document be modified to include digital annotations corresponding to the annotations shown in image 107. Server 102 may respond to the request by inserting the digital annotations. and providing a copy 104 of electronic document 106, which has the corresponding digital annotations, to the reading device 108, or giving order it is to the electronic book reader component 118 in the reading device 108 to insert the annotations.
In embodiments, the image 107 can also be created using a scanning device 130, such as an industrial scanner, or any other type of imaging device (not shown). For example, an individual or an entity (for example, a library, a school, a law firm, or the like) can provide printed documents with annotations to a service provider associated with server 102. The service provider can use a scanner industrial 130, for example, to scan large amounts of documents, complete books, or the like, and to provide images 107 created from the scanning process directly to server 102. In realization forms, a scanning service may use a scanning device. scan 130 and provide the resulting images 107 to the server 102 via the communication network 112.
According to embodiments, to facilitate the synchronization of annotations, the synchronizer 126 retrieves, or alternatively accesses, at least a part of the image 107 of the content storage media 128, and retrieves, or alternatively accesses, at least a part of an electronic version (for example, electronic document 106) of the printed document. The synchronizer 126 identifies a part of electronic text (referred to herein as "part of electronic text") of electronic document 106, which corresponds to the part of printed text captured in image 107. According to Forms of realization, one part of electronic text corresponds to one part of printed text if the two parts of text are textually
5
10
fifteen
twenty
25
30
35
Similar. The expression textual similarity can refer, for example, to a degree of similarity between two parts of text and can be defined, for example, in the context of statistical measures, relationships, or the like. For example, two parts of text can be textually similar if they have a certain number (for example, by comparison with other adjacent text parts) of matching characters, n-grams of characters, or the like.
Additionally, in embodiments, the synchronizer 126 can be configured to analyze the image 107 in order to detect annotations in the part of the printed document and can also be configured to interpret detected annotations. For example, synchronizer 126 can detect an annotation within an image 107 of a portion of a printed document and can determine a type of annotation (for example, highlighting, underlining, and the like) of the annotation detected. According to embodiments, one or more reviewers can be used (for example, by means of a crowd-sourcing model) to facilitate the detection and / or interpretation of annotations, as well as the creation, modification, and / or verification of digital annotations. For example, the server 102 may provide an image of a handwritten annotation, to a review device 132 (for example, through the communications network 112), so that a reviewer can assist in the creation of the annotation digital detecting, interpreting and / or transcribing the handwritten annotation, in digital text, in which searches can be performed using the reading device 108. In embodiments, additional reviewers can verify the interpretations of the first reviewer, corresponding to The handwritten annotation. Crowdsourcing platforms can be used to interact with reviewers and they can include crowdsourcing platforms integrated with server 102 or independent platforms such as, for example, Amazon Mechanical Turk®, provided by Amazon.com® Inc. of Seattle, Washington, U.S. Artificial intelligence algorithms can also be used to interpret, modify and / or verify digital annotations.
The synchronizer 126 may insert a digital annotation corresponding to the annotation detected, in the electronic document 106, and, in embodiments, the electronic book reader component 118 may insert the digital annotation in the copy 104 of the electronic document 106. By For example, the synchronizer 126 may provide an instruction to the electronic book reader component 118, which causes the electronic book reader component 118 to insert the digital annotation in the copy 104 of the electronic document 106. Examples of digital annotation include, although without limitation, an image of a detected annotation, electronic text, a formatting object (for example, code that causes a part of the reproduced electronic text to include a
5
10
fifteen
twenty
25
30
35
highlighted part, an underlined part, or the like), an electronic bookmark, a digital bookmark, a hyperlink, a set of instructions for modifying a part of electronic document 106 (or copy 104 thereof), and the like.
Forms of realization of the present invention can also facilitate various operations for managing digital annotations. For example, the synchronizer 126, and / or the electronic book reader component 118, can be configured to delete a digital annotation of an electronic document 106 (or copy 104 thereof) that has been previously inserted during a synchronization process, and that is no longer found in the corresponding printed document. Additionally, the synchronizer 126, and / or electronic book reader component 118, can be configured to distinguish between digital annotations that were inserted during a synchronization process (referred to herein as migrated digital annotations) and annotations digitals that were added directly in electronic document 106 (or copy 104 thereof) (referred to herein as direct digital annotations). In embodiments, a digital annotation may include an attribute that specifically indicates whether the digital annotation was inserted during a synchronization process or was directly added by a reader. The attribute can be represented by a tag value that is associated with electronic document 106 and the synchronizer can determine whether the annotation is a migrated digital annotation or a direct digital annotation by determining the value of the tag. In this way, embodiments of the present invention can facilitate the elimination of migrated digital annotations, while minimizing the unintentional elimination of direct digital annotations.
Tag values can also be associated with migrations of particular annotations. For example, a reader can place a first posit related to a first content (for example, a first legal case) in a printed document. When an initial digital annotation, corresponding to the first positive, is inserted in an electronic version of the printed document, a tag value associated with the digital annotation can provide information about the migration instance. The tag value may indicate, for example, that the first digital annotation is related to the first case (for example, when the reader can specify the tag value), which was created at a particular time, or the like. Subsequently, the reader could remove the first posit from the printed document and replace it with a second posit related to a second content (for example, a second legal case). In embodiments, in the electronic document a second digital annotation can also be inserted, corresponding to the second posit, and it can have a value
5
10
fifteen
twenty
25
30
35
associated label indicating, for example, that it is related to the second legal case. The reader may be presented with an option to keep or delete the first digital annotation, customize tag values associated with the first and / or the second digital annotations, or the like.
Forms of realization of the invention also facilitate the manipulation of digital annotations. For example, digital annotations can be stored as separate elements attached to the electronic document, which can be manipulated, or digital annotations can be integrated into the electronic document, which can be manipulated. Additionally, the synchronizer 126, and / or the electronic book reader component 118, can be configured to adjust the digital annotation positions, such that new digital annotations do not hide existing digital annotations. In embodiments, the synchronizer 126, and / or the electronic book reader component 118 may allow a reader to search, manipulate, change position, edit, delete, or otherwise manage digital annotations.
Forms of realization of the present invention can also facilitate the selective visualization of digital annotations. For example, if copy 104 of the electronic document with notes 106 is used during a court session, it may be desirable to present a section of copy 104 to the opposite party, the judge, and / or the jury without displaying one or more entries digital contained in it. Thus, the electronic book reader component 118 may include an option to hide one or more digital annotations when copy 104 is displayed. To facilitate this, a digital annotation may include a label that allows the digital annotation to be displayed or hidden, on the base of the tag value. For example, a reader can specify that digital annotations that present a certain tag value (for example, digital annotations related to a first legal case, as described in the previous example) are hidden. The tag value can also enable conditional formatting of a digital annotation. For example, the size, shape, file format, and / or disposition of a digital annotation can be adjusted based on a tag value, which can be static or dynamic and can be assigned based on characteristics. of nearby digital annotations, memory limitations, available screen area, reading device capabilities, or the like. Additionally, tag values can be generated manually or automatically. For example, the electronic book reader component 118 can assign a particular value to a tag, based on an event or condition, and the electronic book reader component 118 can make a selectable option (for example, a button or icon ) be presented in the
5
10
fifteen
twenty
25
30
35
visualization module 110 so that, upon receipt of a selection of the option, one or more entries, which otherwise can be displayed, then they are not.
According to embodiments, several components of the operating environment 100, illustrated in Figure 1, can be implemented in one or more computer devices. For example, each of the server 102, the reading device 108, the scanning device 130, and the review device 132 may be, or include, one or more computer devices. An information device may include any type of information device suitable for implementing embodiments of the invention. Examples of computer devices include "workstations," "servers," "laptops," "desktops," "tablet computers," "handheld devices," "e-book readers," and the like. all of them being contemplated within the scope of figure 1 and referring to various components of the operating environment 100.
In embodiments, an information device includes a bus, directly and / or indirectly, it couples the following devices: a processor, a memory, an input / output (I / O) port, an I / O component, and a source of food. An unlimited number of additional components, different components, and / or combinations of components can also be included in the computing device. The bus represents what can be one or more buses (such as, for example, an address bus, a data bus, or a combination thereof). Similarly, in realization forms, the computing device may include several processors, several memory components, several I / O ports, several I / O components, and / or several power supplies. Additionally, an unlimited number of these components or their combinations can be distributed and / or duplicated over a number of computing devices.
In embodiments, memories 116 and 124 include computer-readable media, in the form of volatile and / or non-volatile memory and may be removable, non-removable, or a combination thereof. Examples of media include Random Access Memory (RAM); Read Only Memory (ROM); Read Only, Programmable and Electronically Erasable Memory (EEPROM); Flash memory; optical or holographic media; magnetic cassettes, magnetic tape, magnetic disk storage devices or other magnetic storage devices; data transmissions; or any other means that can be used to encode information and which can be accessed by means of an information device such as, for example, quantum status memory, and the like. In embodiments, memories 116 and 124 store executable instructions by
5
10
fifteen
twenty
25
30
35
computer to cause processors 114 and 122, respectively, to execute aspects of methods of realization of methods and procedures described herein. The instructions executable by computer may include, for example, computer code, instructions usable on machines, and the like, such as, for example, program components capable of being executed by one or more processors associated with a computer device. Examples of such program components include the electronic book reader component 118 and the synchronizer 126. Part or all of the functionality contemplated herein can also be implemented in hardware and / or microprograms.
The illustrative operating environment 100 shown in Figure 1 is not intended to suggest any limitation as to the scope of use or functionality of embodiments of the present invention. Nor should the illustrative operating environment 100 be interpreted so as to present any dependency or requirement in relation to any individual component or combination of components illustrated herein.
Figure 2 is a schematic diagram depicting an illustrative operation of the synchronizer 126 in accordance with embodiments of the present invention. As shown in Figure 2, the synchronizer 126 may include a detection component 202, a comparison component 204, and a digital annotation component 206. According to embodiments, any one or more of the components 202, 204, and 206 of the synchronizer 126 depicted in Figure 2 may share resources, or be integrated, with various of the other components represented in the same location (and / or components not shown). Additionally, in embodiments, operations of components 202 and 204 can be carried out in any order, cycle, combination, or the like. For example, the comparison component 204 may be used to identify an electronic document part that corresponds to a printed document part prior to the use of the detection component 202 to detect an annotation in the printed document part. Additionally, the detection component 202 can be used to detect an annotation in a part of a printed document prior to the use of the comparison component 204 to identify the corresponding electronic document part in which a corresponding digital annotation is to be inserted. Additionally, any one or more of components 202, 204, and 206 may reside on server 102 or reading device 108, or may be distributed between server 102 and reading device 108.
According to embodiments, the detection component 202 detects an annotation
5
10
fifteen
twenty
25
30
35
218 in a printed document part 210. For example, the detection component 202 may receive an image 107 of the printed document part 210 (for example, from the memory 124) and may perform one or more procedures to detect the annotation 218 In embodiments, the detection component 202 can also identify a type of annotation corresponding to annotation 218. Examples of annotation types include text, highlights, underlines, bookmarks, bookmarks, and the like. An unlimited number of different types of procedures can be used to detect annotations and / or identify types of annotation. Examples of such procedures may include manual writing recognition procedures, optical character recognition (OCR) procedures, bitmap comparison procedures, statistical language models, statistical classifiers, neural networks, crowd-sourcing, and the like.
For example, the detection component 202 can analyze the image 107 (for example, by examining plxels or blocks of plxels) to establish patterns associated with the part of printed text 216 and to detect a characteristic, or characteristics, anomalies that can represent the annotation 218. Examples of such anomalous features may include instances of different colors (for example, associated with a highlight), instances of irregular shapes and edges (for example, associated with handwritten or underlined notes, instances of hidden text or geometric shapes that have different shades of color (for example, associated with posits, bookmarks, or bookmarks), and the like, for example, a part of printed text may have been highlighted with a yellow text marker, which can be detected as a yellow feature that partially overlaps a portion of printed text. In embodiments, the detection component 202 p You can use statistical techniques to determine a probability that a detected anomaly represents an annotation 218.
According to embodiments, the detection component 202 can detect the annotation 218 in the printed document portion 210 by comparing the image 107 of the printed document portion 210 with at least a portion of the electronic document 106. For example, when the comparison component 204 is used before the detection component 202, the comparison component 204 can provide an indication, to the detection component 202, on an electronic text part 214 corresponding to the printed text part 216. The component of detection 202 can, for example, access, or create, a bitmap of a corresponding part of electronic document 106 and compare that bitmap with a bitmap (eg, image 107) of the printed document portion 210 to identify differences between the two bitmaps. One difference
5
10
fifteen
twenty
25
30
35
in the bitmaps it can represent, for example, the presence of the annotation 218 that is not present in the electronic document 106. A difference in the bitmaps can also represent the presence of a digital annotation (not shown) in the document electronic 106 that is not present in the part of printed document 210.
In embodiments, the detection component 202 may use an image zoning procedure to split the image 107 into zones. An image zoning procedure may define one or more text areas corresponding to printed text, and one or more candidate areas corresponding to areas that may include annotations such as, for example, parts of a margin. The image zoning procedure can define a margin as a region that is to the left or right, respectively, of a text zone located further to the left, or more to the right. In embodiments, a text zone can also be a candidate zone. A handwriting recognition process can be applied to an area in an attempt to recognize a handwriting. If the handwriting recognition process is successful in recognizing handwriting within the zone, the detection component 202 may identify the handwriting recognized as an annotation. The recognition of the handwriting can include the conversion of the handwriting into electronic text in which searches can be performed, and which can be inserted into the electronic document as a direct digital annotation.
In embodiments, a handwriting portion may not be recognizable by the handwriting recognition procedure or the area may not include handwriting, and, as a consequence, the handwriting recognition procedure may not have Success with the recognition of handwriting. Additionally, a manual write recognition procedure may not be available or may be impractical (for example, due to memory limitations). In such cases, other types of analysis on the area can be carried out to determine if it contains unrecognizable handwriting or other types of annotations. For example, an OCR procedure can be applied to try to identify annotations within the zone (for example, by labeling characteristics detected within the zone of which a correspondence with a textual character cannot be established by means of the OCR procedure), as described in more detail later in reference to Figure 5. In embodiments, a region detection procedure can be used to identify regions of homogeneous color superimposed on, or adjacent to, detected printed text, which may facilitate identification. of highlighted, underlined, and similar parts. It can be used
5
10
fifteen
twenty
25
30
35
An unlimited number of statistical classifiers, heuristics, neural networks, and the like, to detect annotations, and tools such as those mentioned can be improved periodically or continuously using machine learning techniques. As indicated above, the detection component 202 can also determine the type of annotation 218 it detects. For example, the detection component 202 may determine that an annotation detected 218 in the printed document portion 210 is of a particular type of annotation based, at least partially, on query tables, annotation features 218, feedback from a reviewer, classifier outputs, and the like.
As indicated above, the comparison component 204 is configured to access at least a portion of an image 107 of a printed document portion 210 of a memory (for example, the memory 124 shown in Figure 1) and to access at least a part of an electronic document 106 of a memory. The comparison component 204 identifies an electronic text part 214 in the electronic document 106, which corresponds to (for example, which is textually similar to) a printed text part 216 of the printed document part 210. The comparison component 204 may use algorithms that incorporate OCR techniques, syntactic analysis techniques, and / or the like, to identify a part of electronic text 214 that is textually similar to a part of printed text 216.
In embodiments, an electronic version 106 of a printed document may not include text identical to that of the printed document. For example, a part of printed text 216 may include a text passage and three corresponding footnotes, while the corresponding part of electronic text 214 may include text passage and four corresponding footnotes (for example , the fourth footnote can present a recently decided case) where, for example, the electronic version is a more recent edition of the document. To facilitate the identification of corresponding text parts in a manner that takes into account textual variations such as those mentioned, algorithms used by the comparison component 204 can be configured to evaluate textual similarities between the text parts 214 and 216. As previously mentioned, a textual similarity may refer to a degree of overlap between recognized characters (for example, by means of OCR) in the part of printed text 216 and characters in the part of electronic text 214. For example, the component of Comparison 204 may use n-gram search techniques to identify a set of lines of an electronic document 106 that includes the largest number of n-grams that match n-grams of the recognized text of
5
10
fifteen
twenty
25
30
35
the printed text portion 216. In embodiments, the comparison component 204 may use an unlimited number of statistical comparison techniques to evaluate textual similarity.
In some cases, the printed text portion 216 may only include parts of words and / or characters. For example, the image 107 can be captured in such a way that the entire printed document part 210 is not captured within the image 107, or a posit or other object may hide part of the printed text part 216 when the image is captured 107. In embodiments, identical text comparison can be complicated due to inaccurate character recognition during OCR procedures. According to embodiments, the comparison component 204 may use character pruning techniques to facilitate the evaluation of textual similarities between text parts 214 and 216. Turning briefly to Figure 3, a part 302 of a document is illustrated printed that has a part of printed text 304. When a user captures a region 306 of the part of printed document 302 using an imaging device, the captured region 306 may include only a fraction of the part of printed text 304, such as illustrated. When the captured region 306 is digitized 308 to create an image 310, the image 310 may include lines of text, words, or incomplete characters. For example, as shown in Figure 3, the illustrated image 310 includes partial words 312, some of which include partial characters. When image 310 is processed 314 using an OCR procedure, the resulting recognized electronic text 316 may include recognition errors 318 (illustrated as underlined characters). Character sequences (for example, partial words) that include recognition errors 318 may not be accurately matched with entries from an OCR dictionary, and may potentially reduce the efficiency and / or effectiveness of the comparison procedure. . In embodiments, suffix matrices, comparison of regular expressions (for example, using wildcard characters) and / or methods of comparing approximate strings (for example, Edition Distance algorithms) can be used to interpret character sequences containing recognition errors 318.
Additionally, embodiments of the invention include eliminating 320 character sequences containing recognition errors 318 to create recognized and pruned electronic text 322. Recognized and pruned electronic text 322, which includes recognized electronic text 316 without errors of recognition 318, can be used to facilitate a comparison procedure. The comparison procedure may involve the evaluation of a textual similarity (rather than an identity) and, therefore, a piece of text
5
10
fifteen
twenty
25
30
35
recognized that it has been pruned may still be useful in identifying a corresponding part of electronic text. For example, one or more search queries can be used to search the electronic document for corresponding portions of electronic text, and they can result in the retrieval of parts of electronic text that are classified so as to allow one or more of they are identified as corresponding electronic text parts. The identification of corresponding electronic text parts can be further facilitated, with the use of crowd-sourcing models, by presenting to a reader or reviewer one or more parts of electronic text recovered (for example, the parts of electronic text recovered with the best classification, parts of electronic text with similar classification, or similar), and requesting an entry from the reader or reviewer, which can be used to identify or confirm the corresponding part of electronic text. Crowd-sourcing can also be used for correction of OCR recognition errors. For example, character sequences containing recognition errors 318 (and, in realization forms, surrounding character sequences) can be provided to reviewers, who can verify the interpretation of OCR or provide suggestions for a correct interpretation of Character strings
In accordance with embodiments of the invention, the comparison component 204 may use various procedures, classifiers, and the like, to determine which pruning fractions of a recognized piece of text. For example, recognition errors may appear near the limits of a part of recognized text, and recognized words near one or more of the limits not found in the OCR dictionary can be pruned. Additionally, the image of the printed document portion can be analyzed to determine which regions of the image include printed text, which can also be useful to facilitate pruning procedures.
Returning to Figure 2, the synchronizer 126 may include a digital annotation component 206 that facilitates the insertion of a digital annotation 220 in electronic document 106 (or a copy 104 thereof). The insertion of a digital annotation 220 in an electronic document 106 or copy 104 may include, for example, the insertion of electronic text (for example, as a direct digital annotation provided by a reader, such as a migrated digital annotation, or the like), the addition of code corresponding to digital annotation 220 to an extensible markup language, the association of an annotation file to an electronic document file, the incorporation of digital annotation 220 into an electronic document file, or the like. The digital annotation component 206 can facilitate the
5
10
fifteen
twenty
25
30
35
insertion of digital annotation 220 by inserting digital annotation 220 into electronic document 106 (and / or a copy 104 thereof), which can then be provided to reading device 108.
In accordance with embodiments, the digital annotation component 206 inserts the digital annotation 220 by accessing the electronic text of the electronic document with the use of an application programming interface (API). For example, digital annotation component 206 can insert a digital annotation into a PDF document using the Adobe® Developer API, available from Adobe Systems, Inc., of San Jose, California. Digital annotation 220 can be inserted into an electronic HTML-based document by providing an HTML overlay or by providing a metadata schema or file and inserting, in the HTML file, a pointer to the metadata schema or file. The digital annotation 220 can be generated using a manual write recognition procedure and it can be inserted into the electronic document as an electronic text in which searches can be performed. Additionally, the digital annotation component 206 can facilitate the insertion of the digital annotation 220 by providing the digital annotation 220 to the electronic book reader component 118, which inserts the digital annotation 220 in the copy 104 of the electronic document 106. The digital annotation 206 can provide an instruction that is communicated to the electronic book reader component 118 and that causes the electronic book reader component 118 to insert digital annotation 220 in copy 104 of electronic document 106.
As described above, a synchronizer can facilitate the navigation of an electronic document and / or the synchronization of annotations between a printed document and an electronic document. Fig. 4 is a flow chart depicting an illustrative method 400 of synchronization of annotations between a printed document and an electronic document, in accordance with embodiments of the present invention. Methods of realization of method 400 include receiving an image of a printed document part (block 410). The printed document part may include an annotation in the vicinity of a part of printed text. A synchronizer (for example, the synchronizer 126 shown in Figure 1) can receive the image of a reading device (for example, the reading device 108 shown in Figure 1), a scanning device (for example, the device Scan 130 shown in Figure 1), or the like, and may store the image in a memory (eg, memory 124 shown in Figure 1).
5
10
fifteen
twenty
25
30
35
The synchronizer can recover the image and an electronic document from memory (block 420). The synchronizer can use language modeling techniques, OCR, or the like to identify the electronic document that corresponds to the printed text portion. This may include the use of a language classifier to determine the language in which the book is written. In embodiments of method 400, the synchronizer identifies a part of electronic text within the electronic document, which corresponds to the part of printed text (block 430). For example, the synchronizer may include a comparison component (for example, comparison component 202 shown in Figure 2) that identifies a part of electronic text that is textually similar to the part of printed text. The comparison component can carry out a search within the electronic document, in which lines of the part of printed text are expression queries that are applied with respect to an indexed form of the electronic document. The indexed electronic document may include, for example, suffix matrices. Multiple expression queries that lead to collection results with the best classification can be used to identify the corresponding electronic text portion.
Methods of realization of method 400 also include the detection of an annotation in the part of the printed document (block 440). For example, a detection component (for example, the detection component 204 shown in Figure 2) can detect the annotation by analyzing an image of the printed document part. Annotations can be detected using OCR procedures, manual write recognition procedures, classifiers, or a combination thereof, as described in more detail below, with reference to Figure 5. As shown in Figure 4 , in the electronic document and / or in a copy of the electronic document (block 450) a digital annotation is inserted, corresponding to the annotation detected. As described above, the digital annotation can be inserted into the electronic document by means of the synchronizer, or it can be inserted into a copy of the electronic document by means of an electronic book reader component (for example, the reader component of electronic books 118 shown in figure 1). The digital annotation can be inserted in the vicinity of the corresponding electronic text part, in a position that corresponds at least substantially to the position of the annotation detected in the printed document part.
Figure 5 is a flowchart depicting an illustrative method 500 of detecting an annotation in a printed document part by analyzing an image of the printed document part, in accordance with embodiments of the present invention. Forms of embodiment of the illustrative method 500 include the definition of a candidate zone (block
5
10
fifteen
twenty
25
30
35
510), for example, by dividing the image into several zones, which may include text zones, candidate zones, and the like. In some cases, text zones may also be candidate zones (for example, in cases where annotations may be present in or between lines of printed text). A detection component (for example, the detection component 204 shown in Figure 2) can divide the image into areas based on a geometric pattern (eg, rectangular), into regions of the image that contain homogeneous characteristics, or the like. One or more statistical classifiers can be applied to distinguish between regions of the image that are more likely to include printed text and regions that are more likely to contain annotations. Forms of realization of the invention then include the execution of one or more of the following steps (520 to 580) for each defined zone.
Methods of realization of method 500 include the execution of a handwriting recognition procedure (block 520) on a candidate zone, which can be used, for example, to detect handwritten annotations in the candidate zone. Additionally, handwritten text in text areas (for example, between lines of printed text) can be identified using classifiers that distinguish between handwriting and printed text. Examples of such classifiers include vector support machine classifiers (SVM), nearest neighbor k classifiers (K-NN), Fischer discriminators, neural networks, minimum distance classifiers, and the like.
An OCR procedure (530) can be executed on the candidate zone. The OCR procedure can be used to extract characteristics of the candidate zone, and to determine if the extracted characteristics are associated with printed text. If the OCR procedure does not result in the detection of any printed text, it can be deduced that the candidate zone may contain annotations. Similarly, if the OCR procedure detects printed text in only a fraction of the candidate zone, it can be deduced that other fractions of the candidate zone may contain annotations. The OCR procedure may include, for example, feature extraction, matrix comparison, or a combination thereof.
Methods of realization of method 500 include the application of a statistical language model at the character level (block 540) on features extracted by the OCR procedure. The character level statistical language model can be used to calculate a probability that a missing feature includes a character string that is typical of a particular language, for example, P (sc), where sc is a
5
10
fifteen
twenty
25
30
35
character sequence Additionally, a statistical language model can be applied at the word level (block 550) on missing features and it can be used, for example, to calculate a probability that an missing feature includes a typical word sequence of a particular language , for example, P (sw), where sw is a sequence of words. Language models can be applied to facilitate the determination of whether the features extracted are probably associated with printed text.
Methods of realization of method 500 also include the determination of color information associated with the candidate zone (block 560). For example, the image of the part of the printed document can be a color photo and the detection component can analyze the photo to identify each pixel that includes a color (for example, different from black or white), the color of each pixel , color characteristics of the pixels (for example, tone, saturation, intensity), and the like. Thus, for example, regions of color in a candidate zone can be detected and used to facilitate the detection of an annotation and / or identification of a type of annotation corresponding to a detected annotation. For example, a relatively square region of yellow pixels in a candidate zone defined in a margin can be identified as a yellow positive, while a narrower rectangular region of pink pixels in a text area can be identified as a highlight. Supervised machine learning methods can be used to distinguish between different types of annotations (for example, posits with respect to marginal handwritten notes with respect to a green highlight with a text marker).
Methods of realization of method 500 also include comparing the image with a corresponding electronic document part (block 570). A bitmap of the image can be compared with a bitmap of the corresponding electronic document portion to assess differences between the pixels of the two bitmaps. For example, a detection component can superimpose the bitmap of the image on the bitmap of the electronic document part and redirect the overlapping bitmaps until a minimum number of superpositional pixel differences is obtained. Differences of remaining pixels (for example, pixels that appear in the bitmap of the image and that do not appear in the bitmap of the electronic document or vice versa) may represent annotations. Additionally, noise thresholds can be used to ignore small differences (for example, 1 or 2 disconnected pixels) that can represent, for example, dust particles, camera lens defects, or the like. As with any of the other procedures described herein, the procedure for
Bitmap comparison can be improved using machine learning techniques.
Methods of realization of method 500 include the classification of the candidate zone (block 580). For example, a binary statistical classifier can be used to classify the candidate zone 5 as having or not an annotation. The classifier may consider
input characteristics, such as information generated from one or more of the steps of embodiments of the method 500 described above, such as, for example, P (sc), P (sw), color information, pixel differences in bitmap superposition, the dimensions of the candidate zone, and the position of the candidate zone with respect to a margin of the printed page. For example, the classifier can take, as inputs, x, y, w,
and h, where x is the horizontal distance of the candidate zone with respect to the margin of the
page, and is the vertical distance of the candidate zone with respect to the margin of the page, w is the width of the candidate zone, and h is the height of the zone, all of which can be determined during fractionation. In embodiments, the classifier may adopt, as an entry, an unlimited number of different types of information in addition to the
corresponding previously described, or instead of the latter. Additionally, the classifier can be trained, using an unlimited number of machine learning techniques, in order to improve its ability to classify zones. Examples of features that can be used to train the classifier include color, tone, saturation, underlining, handwriting, 20 sources, margin positions, pixel noise, and the like.
Although embodiments of the present invention have been described with specificity, the description itself is not intended to limit the scope of this patent. On the contrary, the inventors have contemplated that the claimed invention could also be materialized in other ways, to include different stages or characteristics, or combinations of stages or characteristics similar to those described herein, in conjunction with other technologies.
权利要求:
Claims (20)
[1]
5
10
fifteen
twenty
25
30
35
1. Method implemented by computer to synchronize annotations between a printed document and an electronic document, characterized in that it comprises:
receive, on a server, an image of a printed document part that has a printed text part, including the printed document part an annotation in the vicinity of the printed text part;
access at least part of the electronic document from a memory device, the electronic document comprising an electronic version of the printed document;
identify a part of electronic text within the electronic document, which corresponds to the part of printed text;
detect the annotation on the part of the printed document; Y
facilitate the insertion of a digital annotation in at least one of the electronic document and a copy of the electronic document, the digital annotation corresponding to the detected annotation, the digital annotation being inserted in the vicinity of the part of the electronic text, identified, in a position that corresponds at least substantially to a position of the annotation detected in the part of the printed document.
[2]
2. Method according to revindication 1, which also includes:
identify an additional digital annotation in the electronic document;
determine that the part of the printed document does not include an additional entry corresponding to the additional digital entry;
determine if the additional digital annotation is a migrated annotation; Y
delete the additional digital annotation from the electronic document if the additional digital annotation is a migrated annotation.
5
10
fifteen
twenty
25
30
35
[3]
3. Method according to claim 2, wherein determining whether the additional digital annotation is a migrated annotation comprises determining a value of a label associated with the additional digital annotation.
[4]
4. Method according to claim 1, which also includes:
identify a type of annotation corresponding to the annotation detected, the type of annotation comprising at least one of a highlighted part, a handwritten text, an underlined part, and a bookmark, and comprising the facilitation of insertion of the annotation digital in the electronic document the creation of a digital annotation of the type of annotation identified.
[5]
5. Method according to claim 1, wherein the facilitation of the insertion of the digital annotation in the electronic document comprises inserting electronic text in a margin of the electronic document.
[6]
6. Method according to claim 1, wherein the electronic document comprises a document format independent of the platform.
[7]
7. Method according to claim 1, wherein the identification of the corresponding electronic text part comprises identifying a textual similarity between the printed text part and the corresponding electronic text part.
[8]
8. Method according to claim 7, wherein the identification of the textual similarity comprises:
recognize a fraction of the printed text part by performing an optical character recognition (OCR) procedure on the image of the printed document part;
convert the recognized fraction of the part of printed text into recognized electronic text using the OCR procedure;
identify at least one sequence of characters in the recognized electronic text, which includes a recognition error;
5
10
fifteen
twenty
25
30
35
create a recognized electronic text pruned, the recognized electronic text comprising pruned the recognized electronic text from which said at least one sequence of characters has been removed; Y
search the electronic document using at least one search query that includes the recognized electronic text pruned.
[9]
9. One or more computer-readable media, characterized in that they comprise computer-executable instructions embodied therein, to facilitate synchronization of annotations between a printed document and an electronic document, including the instructions a plurality of program components, comprising the plurality of program components:
a comparison component that (1) receives an image of a printed document part that has a printed text part and (2) identifies an electronic text part within the electronic document, which corresponds to the printed text part; Y
a digital annotation component that facilitates the insertion of a digital annotation in the electronic document in the vicinity of the corresponding electronic text portion identified.
[10]
10. Support according to claim 9, wherein the comparison component uses an optical character recognition (OCR) procedure to recognize the part of printed text within the printed document part, and using the comparison component one or more search queries to identify the corresponding electronic text part.
[11]
11. Support according to claim 10, wherein the comparison component uses a pruning procedure to eliminate recognition errors of the portion of recognized printed text.
[12]
12. Support according to claim 9, wherein the digital annotation comprises at least one of a direct digital annotation and a migrated digital annotation.
[13]
13. Support according to claim 12, which also includes:
5
10
fifteen
twenty
25
30
35
a detection component that detects an annotation in the printed document part by analyzing at least the image of the printed document part, including the printed document part the annotation in the vicinity of the printed text part.
[14]
14. Support according to claim 13, wherein the detection component divides the image of the part of the printed document into a plurality of zones, the plurality of zones comprising at least one text zone and at least one candidate zone.
[15]
15. Support according to claim 14, wherein the detection component analyzes said at least one candidate zone using at least one of a manual write recognition procedure, an OCR procedure, a statistical language model, a comparison by bitmap superposition, and a statistical classifier.
[16]
16. Support of claim 13, wherein the detection component detects the annotation based, at least in part, on feedback received from at least one reviewer by means of a crowd-sourcing platform.
[17]
17. System that facilitates the synchronization of annotations between a printed document and an electronic document, characterized in that it comprises:
a server configured to receive, from an image forming device, an image of a part of a printed document that has an annotation in the vicinity of a part of printed text, the server comprising a processor that instantiates a synchronizer configured to:
(a) identify a corresponding electronic text part in the electronic document, the corresponding electronic text part being textually similar to the printed text part,
(b) detect the annotation in the part of the printed document, and
(c) facilitate the insertion of a digital annotation in the electronic document, the digital annotation corresponding to the annotation detected.
[18]
18. System according to claim 17, wherein the synchronizer is configured to detect the annotation using at least one of a recognition procedure
of manual writing, an OCR procedure, a statistical language model, a bitmap overlay comparison, and a statistical classifier.
[19]
19. System according to claim 17, wherein the synchronizer is configured to facilitate insertion of the digital annotation in the electronic document (1) by associating the
electronic document a metadata file containing the annotation, and (2) inserting in the electronic document a pointer to the metadata file.
[20]
20. System according to claim 17, wherein the image forming device 10 comprises at least one of a camera arranged in a reading device and a
industrial scanning device.
类似技术:
公开号 | 公开日 | 专利标题
ES2555180B1|2016-10-19|Method implemented by computer to synchronize annotations between a printed document and an electronic document, computer readable support and corresponding system
KR100989011B1|2010-10-20|Electronic ink processing
US9256798B2|2016-02-09|Document alteration based on native text analysis and OCR
US20030004991A1|2003-01-02|Correlating handwritten annotations to a document
US9224103B1|2015-12-29|Automatic annotation for training and evaluation of semantic analysis engines
Brunessaux et al.2014|The maurdor project: improving automatic processing of digital documents
WO2019033658A1|2019-02-21|Method and apparatus for determining associated annotation information, intelligent teaching device, and storage medium
US20180129944A1|2018-05-10|Document understanding using conditional random fields
KR101046101B1|2011-07-01|How to adjust the first data structure to the second data structure
Al-Dabbagh et al.2014|Intelligent bar chart plagiarism detection in documents
Au et al.2021|FinSBD-2021: The 3rd Shared Task on Structure Boundary Detection in Unstructured Text in the Financial Domain
Seuret et al.2018|A semi-automatized modular annotation tool for ancient manuscript annotation
Gruber et al.2020|An automated pipeline for robust image processing and optical character recognition of historical documents
US10402636B2|2019-09-03|Identifying a resource based on a handwritten annotation
Rodriguez et al.2017|Extracting, Assimilating, and Sharing the Results of Image Analysis on the FSA/OWI Photography Collection
Nagy2020|Document analysis systems that improve with use
US20160034569A1|2016-02-04|Search device
US20150095314A1|2015-04-02|Document search apparatus and method
RU2760471C1|2021-11-25|Methods and systems for identifying fields in a document
Choudhary et al.2015|Calam: linguistic structure to annotate handwritten text image corpus
US20210319356A1|2021-10-14|Automated non-native table representation annotation for machine-learning models
US20210406453A1|2021-12-30|Mapping annotations to ranges of text across documents
JP5162622B2|2013-03-13|Electronic ink processing
KR101724398B1|2017-04-18|A generation system and method of a corpus for named-entity recognition using knowledge bases
Wang et al.2016|A novel approach for relation extraction with few labeled data
同族专利:
公开号 | 公开日
AU2014223441B2|2017-04-06|
ES2555180R1|2016-02-05|
GB201513239D0|2015-09-09|
US20140245123A1|2014-08-28|
AU2014223441A1|2015-08-20|
WO2014134264A1|2014-09-04|
GB2525787B|2021-04-21|
ES2555180B1|2016-10-19|
US9436665B2|2016-09-06|
GB2525787A|2015-11-04|
引用文献:
公开号 | 申请日 | 公开日 | 申请人 | 专利标题

US7111230B2|2003-12-22|2006-09-19|Pitney Bowes Inc.|System and method for annotating documents|
US7299407B2|2004-08-24|2007-11-20|International Business Machines Corporation|Marking and annotating electronic documents|
US8230326B2|2004-12-17|2012-07-24|International Business Machines Corporation|Method for associating annotations with document families|
US7865734B2|2005-05-12|2011-01-04|The Invention Science Fund I, Llc|Write accessibility for electronic paper|
US7526129B2|2005-06-23|2009-04-28|Microsoft Corporation|Lifting ink annotations from paper|
US20100278453A1|2006-09-15|2010-11-04|King Martin T|Capture and display of annotations in paper and electronic documents|
US7796309B2|2006-11-14|2010-09-14|Microsoft Corporation|Integrating analog markups with electronic documents|
US8520979B2|2008-08-19|2013-08-27|Digimarc Corporation|Methods and systems for content processing|
US8121618B2|2009-10-28|2012-02-21|Digimarc Corporation|Intuitive computing methods and systems|US20140023227A1|2012-07-17|2014-01-23|Cubic Corporation|Broken mag ticket authenticator|
US20140115436A1|2012-10-22|2014-04-24|Apple Inc.|Annotation migration|
TWI651640B|2013-10-16|2019-02-21|3M新設資產公司|Organize digital notes on the user interface|
JP6123631B2|2013-10-28|2017-05-10|富士ゼロックス株式会社|Information processing apparatus and information processing program|
KR20160071144A|2014-12-11|2016-06-21|엘지전자 주식회사|Mobile terminal and method for controlling the same|
EP3234800A4|2014-12-18|2018-08-15|Hewlett-Packard Development Company, L.P.|Identifying a resource based on a handwritten annotation|
US9626594B2|2015-01-21|2017-04-18|Xerox Corporation|Method and system to perform text-to-image queries with wildcards|
US20160259766A1|2015-03-08|2016-09-08|Microsoft Technology Licensing, Llc|Ink experience for images|
US10079952B2|2015-12-01|2018-09-18|Ricoh Company, Ltd.|System, apparatus and method for processing and combining notes or comments of document reviewers|
US10572751B2|2017-03-01|2020-02-25|Adobe Inc.|Conversion of mechanical markings on a hardcopy document into machine-encoded annotations|
US20180260376A1|2017-03-08|2018-09-13|Platinum Intelligent Data Solutions, LLC|System and method to create searchable electronic documents|
US10223607B2|2017-04-14|2019-03-05|International Business Machines Corporation|Physical and digital bookmark syncing|
US10895954B2|2017-06-02|2021-01-19|Apple Inc.|Providing a graphical canvas for handwritten input|
US10353997B1|2018-04-09|2019-07-16|Amazon Technologies, Inc.|Freeform annotation transcription|
JP2019192999A|2018-04-20|2019-10-31|京セラドキュメントソリューションズ株式会社|Image forming apparatus and image forming program|
US11227111B2|2019-05-01|2022-01-18|Sap Se|Graphical user interface providing priority-based markup of documents|
法律状态:
2016-10-19| FG2A| Definitive protection|Ref document number: 2555180 Country of ref document: ES Kind code of ref document: B1 Effective date: 20161019 |
2018-05-24| PC2A| Transfer of patent|Effective date: 20180524 Owner name: THOMSON REUTERS GLOBAL RESOURCES UNLIMITED COMPANY Effective date: 20180524 |
2020-03-31| PC2A| Transfer of patent|Owner name: THOMSON REUTERS ENTERPRISE CENTRE GMBH Effective date: 20200325 |
优先权:
申请号 | 申请日 | 专利标题
US13/781,446|US9436665B2|2013-02-28|2013-02-28|Synchronizing annotations between printed documents and electronic documents|
US13/781446|2013-02-28|
[返回顶部]